Bacterial
whole-genome analysis

Jelmer Poelstra

MCIC Wooster, Ohio State University, USA

2024-02-08

Bacterial isolate whole-genome sequencing

Introduction

Today, we will assemble a bacterial genome from one of the pairs of FASTQ files that you copied yesterday morning:

total 6.1G
-rw-r--r-- 1 jelmer PAS0471 205M Feb  7 11:21 SM04_R1.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 242M Feb  7 11:21 SM04_R2.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 188M Feb  7 11:21 SM1030_R1.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 221M Feb  7 11:21 SM1030_R2.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 187M Feb  7 11:21 SM1031_R1.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 221M Feb  7 11:21 SM1031_R2.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 187M Feb  7 11:21 SM1038_R1.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 224M Feb  7 11:21 SM1038_R2.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 176M Feb  7 11:21 SM1042_R1.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 199M Feb  7 11:21 SM1042_R2.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 172M Feb  7 11:21 SM109_R1.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 198M Feb  7 11:21 SM109_R2.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 157M Feb  7 11:21 SM155_R1.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 181M Feb  7 11:21 SM155_R2.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 155M Feb  7 11:21 SM156_R1.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 185M Feb  7 11:21 SM156_R2.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 146M Feb  7 11:21 SM181_R1.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 159M Feb  7 11:21 SM181_R2.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 195M Feb  7 11:21 SM190_R1.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 241M Feb  7 11:21 SM190_R2.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 192M Feb  7 11:21 SM191_R1.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 234M Feb  7 11:21 SM191_R2.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 194M Feb  7 11:21 SM205_R1.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 242M Feb  7 11:21 SM205_R2.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 176M Feb  7 11:21 SM207_R1.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 210M Feb  7 11:21 SM207_R2.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 137M Feb  7 11:21 SM226_R1.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 165M Feb  7 11:21 SM226_R2.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 192M Feb  7 11:21 SM51_R1.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 224M Feb  7 11:21 SM51_R2.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 193M Feb  7 11:21 SM914_R1.fastq.gz
-rw-r--r-- 1 jelmer PAS0471 226M Feb  7 11:21 SM914_R2.fastq.gz

Why sequence full genomes?

Best possible resolution for, e.g.:

  • Phylogenetics
  • Population structure
  • Epidemiology
  • Association analyses like GWAS

The only way to:

  • Comprehensively detect virulence genes, AMR genes, etc
  • Comprehensively compare gene content among genomes
  • Perform a pangenome analysis

Epidemiology

From Weisberg et al. 2021

Pangenomes

From Koonin et al. 2021

Sequencing technologies

Genome assemblies through time

From Koonin et al. 2021

Most frequent sequencing technologies

The current best way?

From Wick et al. 2023

The current best way? (cont.)

From Wick et al. 2023

A few notes on Illumina sequencing

Sequencing adapters - recap

After library prep, each DNA fragment is flanked by several types of short sequences that together make up the “adapters”:


Paired-end vs. single-end sequencing

In Illumina sequencing, DNA fragments can be sequenced from both ends as shown below — this is called “paired-end” (PE) sequencing:


When sequencing is instead single-end (SE), no reverse read is produced:

Paired-end sequencing

  • Paired-end sequencing is a way to effectively increase the read length.

  • The total size of the biological DNA fragment (without adapters) is often called the insert size:

Insert size variation

Insert size varies based on the library prep protocol aims, and because of variation due to limited precision in size selection. In some cases, the insert size can be:

  • Shorter than the combined read length, leading to overlapping reads:

  • Shorter than the single read length, leading to “adapter read-through” (i.e., the ends of the resulting reads will consist of adapter sequence, which should be removed):

Illumina error profile

Illumina error profile (cont.)

  • The different templates within a cluster get out of sync because occasionally:

    • They miss a base incorporation

    • They incorporate two bases at once


  • Base incorporation may also terminate before the end of the template is reached

This error profile is why, for Illumina:

  • There are hard limits on read lengths
  • Base quality scores typically decrease along the read

Our dataset

Our pathogen

  • Pseudomonas syringae causes disease in a wide range of host plants, from Solanaceae and Leguminosae plants to citrus and stone fruit trees.

  • Pseudomonas syringae pv. syringae (Pss) is an emerging phytopathogen that causes Pseudomonas leaf spot (PLS) disease in pepper plants (Capsicum annuum var. annuum).

  • Copper-based antimicrobials are used as chemical control methods for Pss in peppers. This has resulted in the emergence of copper-resistant strains of Pss.

Our dataset

  • From Ranjit et al. in prep, Ohio State University

  • 16 Pss samples were isolated from pepper plants harboring characteristic PLS symptoms in Ohio between 2013 and 2021.

  • Illumina MiSeq sequencing of these isolates: 2x300 bp reads

Variable genome vs. disease severity

Our workflow

The steps involved

  1. QC and preprocessing of the FASTQ files
  2. Assembly and assembly QC
  3. Annotation
  1. Pangenomics and phylogenetics (not included)

Workflow overview

From Petit & Read 2020

Questions?